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Correlation between Protein and mRNA Abundance in Yeast 
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We have determined the relationship between mRNA and protein expression level* for selected gents 
expressed in the yeast Saccharotnycez cerevisia* growing at mid -log phase* The proteins contained in total yeast 
cell rysate were separated by high-resolution two-dimensional (2D) gel electrophoresis. Over 150 protein spots 
were excised and Identified by capillary liquid chrornatogrtiphy-tandem mass spectrometry (LC-MS/M5). 
Protein spots were quantified by metabolic labeling and scintillation counting* Corresponding mRNA levels 
were calculated from serial analysis of gene expression (SAGE) frequency tables (V- E_ Velculescn, L- Zhang, 
W, Zhou, J. VogeUtein, M. A. Basnti, D. E. Bassett, Jr„ P« Hfeter, B. Vogelstew, and K, W, Klnzler, Cell 
88:243-251, 1997), We found that the correlation between mRNA and protein levels was insufficient to predict 
protein expression levels from quantitative mRNA data. Indeed, for some genes, while the mRNA levels were 
of the same value the protein levels varied by more than 20-tbld. Conversely, invariant steady-state levels or 
certain proteins were observed with respective mRNA transcript levels that varied by as much as 30-Fold. 
Another interesting observation Is that codon bias is not a predictor of either protein or mRNA levels. Our 
results clearly delineate the technical boundaries of current approaches for quantitative analysis of protein 
expression and reveal that simple deduction from mRNA transcript analysis Is insufficient. 



The description of the state of a biological system by the 
quantitative measurement of the system constituents is an es- 
sential but largely unexplored area of biology. With recent 
technical advances including the development of differential 
displAy-PCR (21), of cPNA microarray and DNA chip tech- 
nology (20, 27), and of serial analysis of gene expression 
(SAGE) (34, 35), it is now feasible to establish global and 
quantitative mRNA expression profiles of cells and tissues in 
species for which the sequence of all the genes is known. 
However, there is emerging evidence which suggests that 
mRNA expression patterns are necessary but are by them- 
selves insufficient for the quantitative description of biological 
systems. This evidence includes discoveries of posttranscrip- 
tional mechanisms controlling the protein translation rate (15). 
the half-lives of specific proteins or mRNAs (33), and the 
intracellular location and molecular association of the protein 
products of expressed genes (32). 

Proteome analysis, defined as the analysis of the protein 
complement expressed by a genome (26), has been suggested 
as an approach to the quantitative description of the state of a 
biological system by the quantitative analysis of protein expres- 
sion profiles (36). Froieotne analysis is conceptually attractive 
because of its potential to determine properties of biological 
systems that are not apparent by DNA or mRNA sequence 
analysis alone. Such properties include the quantity of protein 
expression, the subcellular location, the slate of modification, 
add the association with ligands, as well as the rate of change 
with time of such properties. In contrast to the genomes of a 
number of microorganisms (for a review, see reference 11) and 
the transcriptomc of Saccharomyces cewisiae (35), which have 
been entirely determined} no proteome map has been com- 
pleted to date. 

The most common implementation of proteome analysis is 
the combination of two-dimensional gel electrophoresis (2DE) 
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(isoelectric focusing-sodiunt dodecyl sulfate [SDSJ-poIyacryl- 
amide gel electrophoresis) for the separation and quantitation 
of proteins with analytical methods for their identification. 
2DE permits the separation, visualization, and quantitation of 
thousands of proteins reproducibly on a single gel (18, 24). By 
itself, 2D£ fa strictly a descriptive technique. The combination 
of 2DE with protein analytical techniques has added the pos- 
sibility of establishing the identities of separated proteins (1, 2) 
and thus, in combination with quantitative mRNA analysis, of 
correlating quantitative protein and mRNA expression mea- 
surements of selected genes. 

The recent introduction of mass spectromctric protein anal- 
ysis techniques has dramatically enhanced the throughput and 
sensiiivity of protein identification to a level which now permits 
the large-scale analysis of proteins separated by 2DE, The 
techniques have reached a level of sensitivity that permits the 
identification of essentially any protein that is detectable in the 
gels by conventional protein staining (9 F 29). Current protein 
analytical technology is based on the mass spectrometric gen- 
eration of peptide fragment patterns that are idiotypic for the 
sequence Of a protein. Protein identity is established by corre- 
lating such fragment pattens with sequence databases (10, 22, 
37). Sophisticated computer software (8) has automated the 
entire process such that proteins are routinely identified with 
no human interpretation of peptide fragment patterns. 

In this study, we have analyzed the mRNA and protein levels 
of a group of genes expressed in exponentially growing cells of 
the yeast 5. cerwbiac. Protein expression levels were quantified 
by metabolic labeling of the yeast proteins to a steady state, 
followed by 2DE and liquid scintillation counting Of the se- 
lected, separated protein species. Separated proteins were 
identified by in-gel tryptic digestion of spots with subsequent 
analysis by microspray liquid chromatography-tandem mass 
spectrometry (LC-MS/MS) and sequence database searching. 
The corresponding mRNA transcript levels were calculated 
from SAGE frequency tables (35). 

This study, for the first time, explores a quantitative com- 
parison of mRNA transcript and protein expression levels for 
a relatively large number of genes expressed in the same met- 
abolic state. The resultant correlation is insufficient for predic- 
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FIG 1 Schemata illustration ofpn><*ome analysis by 2DTE w?d miu spectraraeny. la part I, proteins irt separated by 2DE, b Lamed spots art eatti 
to in-iel dlKHiloB with trypsin, end the resulting peptides are separated by oo-Une capillary high-perioniuiaos liquid jdiren^tc^phy. Input II, a 
etuting f«m the column In pan I- The peplid* If Ionized by electrospf ay toefcadon and enters the mass spwtrtmctar- The mus of the ionized ! pepdd 



ractry. la part I, proteins ire separated by 2DE, BLairtcd spots art vxd&d and subjected 
-■' " " J " L ^ rtU, a peptide » shown 

peptide Ji detected, and 



the flirt ou^upole m*a* fl£«allci-s only (he specific nu^-to-d^ »JJo of the selected peptide Joniopafl 1-iro the collision «IUIn O^qIIWqo cell the oncrgircd. 

™pVp t W^«UJde with neutral aigoi Ea.™>kcule«. Fragmentation of (he pepci J* k «aantlally random but occurs mainly JU the peptide bonds, resulting m tmallw 
(MpiidJof ottering lengths (fl»uses). Tbew ^^^"'i^J* 221^1 ".5ff™ 



series are rrtorded tfrimkancausly, one each from sequencing inward from the N and C terrejm < 

Selected. fcni«d peptide is cOmp*f«d to predicted ttndera mass spectra computer generated from a sequence t a - 

dudZlblL (he peptide and. by^ciacion, the protein from which cbo peptide was derived can be identified. Unamblgu™ protein identification isetwUied in asmgle 
analyse becitte multiple peptides are Hentffled a* being derived ton fee same protein. 



lion of protein levels from mRNA transcript levels. We have 
also compared the relative amounts of protein and mRNA 
with the respective codon bias values for the corresponding 
gene?, This comparison indicates that codon bias by itself is 
insufficient to accurately predict either the mRNA or the pro- 
tein expression levels of a gene- In addition, the results dem- 
onstrate that only highly expressed proteins are detectable by 
2DE separation of total cell rysates and that therefore the 
construction of complete protcome maps with current technol- 
ogy will be very challenging, irrespective of the type of organ- 
ism. 

MATERIALS AND METHODS 

Yeast unfa and growth eaodkhtoft. Tha aouirt Of protein and tran- 
script* for aR experiments Wtt YEH499 (MAT* un±52 tp2M rt&JOl 
Uuf-M WsJ-AHW trpl-MJ) (30). LogarlthmlcaDy growing tells Wttt> obtained by 
growing yeHi crfb » carty log phW* P X IF celhViol) in YPD rich wdium 
(YPD SUppterfWUOd wfch * niM urtdt *S mM adenine, and 24, mM tryptophan) 
at 30*C (35> Metabolic labethig of protein we* eeeompllshed m YPD nsedlum 



exactly as deterihftd slacwhere (4) with toe exception that 1 ral or cells was 
labeled with 3. mCi to offset methionine present in YPD medium, ftofefa was 
harvested as described by Garreb and coworkers (12). Huv*t*d protein was 
ryopKili**, reauspended in isoelectric focusing gel rehydration solution, M& 
Stored at -80*C 

IDE. Soluble proteins weft run an the first tfknewion by using a commcruaJ 
flatbed electrophoresis system (Mulbphor II; Pharmacia Biotech). bnroebtliied 
p^ryeerytanlde gel (IPG) dry strips with nonlinear pH 3.0 to ia0 gradients 
(AmcTsham-Prianiiacia Eioteth) were used for Die flfll-dl mansion separation. 
Forty micrognkmi Of protein from whole-ceil lysates was raised with IPO strip 
rehydration but 6r (fl M urea, 2ft Nonidei rMQ, 10 mM djtfaiothreitoO* «> d 230 
co M0 pJ o! aolutlon was added to mdhidual lanes of no IPC strip rehydration 
Cray (AmeTfiham-PharniacU Biotech)- The stripe were allowed to rehydratc at 
room temperature for 1 h. The swnplee were run at 300 V-t0 naA-S W tor 2 h, 
then romped to 3i00 V-10 W over ■ period of 3 N and Chen kept at 3,500 
V-10 mA-5 W for 15 to 19 h. At the end of the first-dimension run (60 to 70 *V - 
h\ the IPG strips were ree^ibrated for 9 nln m 2« (wvVol) djthioihTeM in 
2% (wvvol) SJJS-6 M urea-30% (wtAot) glyctrol-OJto M Trl» Ha CpH 6 J) and 
for * mln in 2J% lodoecetamlde in 2% (wt/rtl) SDS-6 M uiea-3(W6 fwl/«*l) 
KbtcroW).05 M Tris HO (pH Pofiowmg re^qulllbrition, the strips *ere 
transferred and append to 10% por/acryl amide second-dhnonalon gelk Pory- 
acrylaniide gels were poured rrt » casting stand with 10% S«yrimlde-2.S7« 
prptTHzine dUKrylamide-0J73 M Tris base-HCI (pH 6\fi>0.1» (wvVol) SDS-0H5% 
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FIG 2. 2D mwnMtfMd gel of the protein* In yeast tolel ce» tmi*- PwteHu wan separated In ft* Ant dimension J^iEfl^ 
»h/™„d ^n£?<wUc3uS tftoleSdar wdgbMlevlng. Protein apou (156) were chosen w Include the ertUre range ofmolccol^^hMtectrlc focusing point* 

the second dimension tycnicaij uy m«™ ^SvV» JL^;... Mm»:rmi bv mn Hnedfonwcn- and database searching. The spots arc labeled on the 

und staining mttneltle* Spots were lacked, and the corresponding prolem wi* identified try mum apcCtromwzy ana aumwc *^ 
™" a ™ concipond to Ihe data presented M Table 1. Molecule waighta a» given in thousands. 



(wt/vol) nrnnwnlum pcrauirwe-0-05% TEMBD (N^'^-lc^mcthjlclh^l- 
Mdia^Enft) mi MjIU-0 water. The apprnT^ius used to run *cenO-C&m*n9*Qfl gels 
wai a noncommercial apparatus from QrftJrd tirycoeelencas, Inc. Once the IPG 
strips wen Jtppoeed to the je<o«d-diiwnsiort gel* they were immediate* run « 
JO mA (conslanO-500 V-W W tor 20 min, tallowed by200itoA <eo*Uiani>-JW 
V-85 W until the buffer front Une was 10 to 15 ram from iha bonora of the gel. 
Qals were removed and silver stained according to die procedure of Shevchcnfco 

* Prot^M t&tlncadoo. Gels wwe exposed w X-ray Aim ovcrniftht. end then the 
sDver staining end flv were used ro wene 156 Spots Of varying tateiHitiea. 
molecular velghu, and Isoelectric focusing polim. In order to increase the 
dececthxi link by mass spectrometry, spots ***e cut out end pooled torn op 10 
four Identical cold, ailverHStaioed gela. In-geJ trypilc digests of pooled **» 
performed u described previously (29). Tryptic JKpridet ^rtlrtelyT«l by mU 
Soeaplllary LC-MS wtth automated latching to MSJMS «odB Icr pep&de 
franDcntation. Spectra were searched against the composite OWL protein se- 
qowvee detibase (version 30.2; 250,314 protein leoucnces) (Mi) by using the 
computet program Sequtst (8), which matches theoretical arid acquired tandem 
ma$j TtMcoiLA protein match we* determined by comparing the number of 
pepddeal Identified end their iwpectlva cross-eorrelatloo icor«L All protein 
idantificAtfons were verified by comparison with theoretical molecutsr weights 
and iaoclcciric points. 



mRNA vuMUiMtloik Veknilescu and coworkers have previously generated 
frequency tables lor y**tt mRNA transcripts from the same stmki grown under 
the tame stated condhiom u described herein (3S>. The SAGE technology b 
based On cwo main principle*. First, a abort sequence teg (15 bp) KhM contains 
sufficient inrbrmetkM uniquely to Identify A transcript br generated. A arflgm tag 
(i usually generated (torn each mRNA «*naerlpt In the cell which corresponds to 
15 bp at the 3'HDost cutting »ta for NlaUl. Second, TO*ny transcript tags cad be 
concatenated leio a single molecule and then sequenced, revealing ft* Identity of 
muldpb tag* simultaneously^ Over 20,000 trarisoipo were sojuenoed from yeast 
Strtid YPH499 growlfif at mid-log phase on grvioeo- Assuniing Ibe prev)ov4rv 
derived estimate or 15,000 mRNA molecule* per cell (1$), this would represent 
a 13-Iold coverage even for roftNA molecules present at a single copy per ccU 
md would provide l 72% probability or detecting such transcripts. Computer 
software which took for input the gene detected, ettrttlfted the snicteotide se- 
quence, sad performed the calculation al described by Vefcukscu and coworkers 
(35) was written. In practice, we found that for 21 of 128 (16%) gene* examined 
viable mRNA levels from SAGE data could not be calculated. This waa becaliM 
(I) no CATG lite waa found in the Open feeding frame (OrF), (I) • CATO site 
was found bul the cwapoodiog lf>bp pvtsilvo SAXiB tag was not found In the 
freMoey tables, or (IH) Mentkal puutivc SAGE tags were present for muMple 
genes (e^ TDH2 YEAST and TTTO_YEAST). 
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TABLE 1. Expressed genes identified from 2D g&l in Fig- 2 

Protein mRNA 
YFD botic abuiufon^ . . Cod on 
MOI^ pi Spo,n* nt £. -J— ^ 



TABLE \— Continued 



17,259 
18,702 
18,726 
18,978 
19,108 
t9.691 
20405 
21,444 
2U83 
22,602 
23J019 
23,743 
24,033 
24,058 
24,353 
24,662 
24,808 
24,908 
15,081 
25,960 
26378 
26,4*7 
26,661 
27,156 
27,334 
27,472 
27.460 
27,480 
27,480 
27.809 
27,874 

28495 
19,156 
29,244 
29,443 
30,012 
30^073 
30^96 
30,435 
31432 
32,159 
32,263 
33411 
34,465 
34.762 
34,797 
34,799 
35456 
33/119 
35,650 
35,712 
35,712 
35,712 
36,272 
36,358 
36,358 
36J596 
36.714 
36,714 
36,714 
36,714 
37,033 
37,796 
37.886 
38,700 
38,702 



6.75 

4.80 

444 

5.95 

5,04 

9.08 

6.07 

525 

4.98 

4.30 

6.29 

5.44 

5.97 

4.43 

6.30 

5.B5 

6.33 

8.73 

4.65 

6.06 

9.55 

5.18 

584 

5J6 

«.13 

5J3 

8.95 

8,95 

8.95 

5.97 

4.46 

431 

6_59 

8.40 

5.91 

639 

4.63 

7-94 

6-34 

5-57 

5.46 

6.00 

5.35 

5.60 

5.32 

545 

fc-04 

5,97 

Ml 

549 

(..72 

6,72 

6.72 

4.85. 

5.05 

5.05 

637 

6.30 

6.30 

630 

630 

6.23 

7.36 

6.49 

7.83 

6.24 



133 

83 

147 

135 

130 

136 

111 

148 

95 

80 

112 

137 

96 

143 

140 

99 

97 

122 

81 

116 

127 

100 

98 

93 

115 

92 

123 

124 

125 

139 

78 

41 

114 

120 

48 

138 

77 

121 

$9 

88 

113 

149 

84 

129 

85 

42 

90 

43 

59 

68 

117 

154 

155 

128 

75 

76 

79 

102 

103 

104 

105 

44 

57 

106 

55 

46 



CPR1 

ECP2 

YXL056C 

YER067W 

YLR109W 

ATP7 

QUK1 

SARI 

TSA1 

EFB1 

SOD2 

HSP26 

ADK1 

YKL117W 

TFSt 

URA5 

OSP1 

RPS5 

MRP 6 

RPEl 

RPS3 

VMA4 

TPI1 

PRE8 

YHR049W 

YNL01CW 

GPM1 

OPM1 

GPM1 

H0R2 

YST1 

PUP2 

YMR226C 

DPMI 

PRE4 

PRB1 

BMHl 

OMPZ 

GPP1 

ILV6 

IPP1 

HIS1 

SPE3 

ADE1 

SEC14 

URAl 

BEL1 

YDL124W 

TDH1 

CAR) 

TDH2 

TDH2 

TDH3 

APA1 

YJR105W 

YJR105W 

ADH2 

ADH1 

ADH1 

ADH1 

ADH1 

TALI 

IDH2 

1LV5 

BAT1 

QCR2 



cell) 

153 
20-1 
61-2 

3.7 
94.4 
11.0 
16.5 

5.4 
110.6 
66.1 
12.6 
NA' 
17.4 
293 

8.1 
25.4 
263 
18.6 

93 

54 
96-8 
103 
NA* 

6.9 
184 
31.6 
IOlO 
231.4 

7-5 

5.7 
13 6 

4.4 
14.5 

5.0 

3.4 
21.2 
14.7 
67.4 
70.2 
13.9 
63.1 
22,4 
15,1 

8,7 
10,9 
493 
103,2 

6.4 
69.8 

5.2 
49.6 
863.5 
79.4 

8.7 
17.6 
27.5 
58.9 
746.1 
17.6 
61.4 
52.7 
444 
29.4 
76,0 
30-9 
NA* 



61.7 

5.2 
88.4 

67 

9-7 
NA** 

3.7 
10.4 
40.] 
23.8 

23 

0.7 
16,4 
10.4 

a7 

6.0 

53 

NA e 
NA C 
0.7 
NA C 
3.7 
NA ff 
0.7 
2.2 
3.7 
169.4 
169.4 
169.4 
0.7 
52.8 
0.7 
2.2 
11.2 
3.7 

I, 5 
28-2 
41.6 
11.2 

3-0 
3.7 
4.5 
6.7 

5- 2 

6- 0 
8,9 

B1.0 
44 
32.r 
3-0 
473-0* 
473-<T 
473.0* 
0.7 
17.1 
17.1 
260.0* 
260.0 
260-0 
260.0 
260,0 
3.7 
6,7 
43 

II. 2 
2.2 



Cod on 
biu 


Molwi 


Pi 


Spot no. 


YPD gene 
TiAmC* 


3burpdanoc 
(ltf* copies/ 
. cell) 


mRNA 
(oopfcifcellj 


Cod on 
bias 


0.769 


39,477 
39/477 


5.58 


86 


FBA1 


17.8 


183.6 


0.935 


0.724 


5.58 


87 


FJ&Al 


4273 


163.6 


0.935 


0.831 


39440 


6.50 


150 


HOM2 


60.3 


43 


0392 


0.118 


39,56 » 


6.12 


156 


PSA1 


96.4 


274 


0.718 


0.680 


41,158 


6.01 


49 


YNL134C 


14.9 


14 


0316 


0.246 


41,623 


7.1B 


56 


BAT2 


19.0 


8.9 


0350 


a422 


41,728 


729 


110 


ERO10 


24.1 


43 


0443 


0.455 


41,900 


5.42 


74 


TOM40 


22.3 


23 


0475 


0-845 


42,402 


6.29 


45 


CYS3 


6.7 


8.9 


0621 


0-875 


42,6*3 


5.63 


67 


DYS1 


- 15.8 


53 


0426 


0351 


43,409 


641 


107 


SER1 


103 


13 


0392 


0.434 


4342) 


549 


91 


ERG6 


2.2 


14.1 


0.408 


0-656 


44,174 


742 


56 


YBR025C- 


13.1 


6.0 


0.684 


0339 


44,652 


4.99 


72 


TTF1 


2.9 


39.4 


0434 


0.146 


44/707 


7.77 


108 


PGK1 


23,7 


165,7 


0497 


0-359 


44,707 


7.77 


109 


POK1 


315-2 


165.7 


0497 


0-735 


46,080 
46383 


6.72 


30 


CAR2 


15-4 


NA e 


0495 


0.899 


832 


53 


IDP1 


7.7 


0L7 


0436 


0241 


46453 


5.98 


47 


IDP2 


3Z4 


NA* 


0.197. 


0372 


46,679 


6.39 


50 


ENOl 


35.4 


0.7 


0.930 


0.B63 


46,679 


6.39 


51 


ENOl 


6.6 


0.7 


0.930 


0.427 


46,679 


6.39 


52 


ENOl 


2.2 


0.7 


0.930 


0,900 


46,773 


5.82 


63 


EN02 


15.5 


289.1 


0.960 


0.129 


46,773 


5.82 


64 


EN02 


635.5 


289.1 


0.960 


0320 


46,773 


5.S2 


65 


ENQ2 


93.0 


289.1 


0.960 


0.421 


46,773 


5.82 


66 


BN02 


31.0 


289.1 


0.960 


0.902 


47,402 


6.09 


126 


COR1 


2.5 


0*7 


0-422 


0.902 


47,666 


6.98 


54 


AAT2 


11.7 


6.0 


0438 


0.902 


48,364 
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TABLE 1— Contmiud 



YPD gftito 

Molwt pi Spat no. 



61,903 
62,266 
62,862 
63.0S2 
64,335 
66,120 
66,120 
66,450 
66,450 
66,456 
66,456 

66,456 
68^97 
69,313 
69,313 
74,378 
75396 
85.730 
65,720 
35,720 
93.276 
#.276 
10(2,064* 
107^ 



6.21 
6.19 

£.02 

r..4o 

fl.77 

Ml 

5.42 

5.29 

5.29 

5.23 

5.23 

523 

5JS2 

4.90 

4.90 

S.46 

5.S2 

6.25 

6.25 

6.25 

6.11 

6.11 

6.61* 

533* 



101 

16 

19 

119 

5 

B 

9 

141 

142 

10 

11 

12 

82 

13 

14 

15 

6 

1 

2 

3 

131 
132 
S>4 
4 



PDC5 

ICL1 

ILY3 

PGM2 

PAB1 

STI1 

SUl 

SSB2 

SSB2 

SSBl 

SSB1 

SSQ1 

LEU4 

SSA2 

SSA2 

YKL029C 

GRS1 

MET6 

MET6 

MET6 

EFT1 

EFT1 

ADE3 

MCM3 



A3 
20-1 
53 
2.2 
30.4 
6.7 
6.4 
7.0 
15 
64.5 
59-0 
117 
3.1 
243 
77.1 
18 
5.5 
2.0 
10-9 
1.4 
17-9 
5.7 
4.8 
2-7 



r*A* 

4.5 
3.0 
1_5 
0.7 
0.7 
NA C 
MA' 

79i 
79,5 

3.0 
lfl.6 
18.6 

3/7 

7A 
NA r 
NA* 
NA* 
41.6 
41.6 

5.2 
NA C 



0.B1S 
0J27 
0_34g 
0.402 
0.616 
0.313 
0313 
0.880 
0.880 
0.907 
0.907 
0.907 
0.407 
0.892 
0^92 
0353 
0300 
0.773 
0-772 
0.772 
0-690 
0390 
0.423 
0.24O 



" YFD gene narnu are ™Ublc (torn the VPD website (39). 
* NA, calculation could not be performed 0' was not available. 
' fflRNA data inconclusive or NA 

d No mediknldes In predicted ORF; therefore, protein wncviflraMM was not 
deleriDiAcd. 

' Measured rooteeular weight dt pi did not motch theoroUcal molecular wtifthr 
or pi. 



Proten ^uanikatloo. [ u S]iiwtWonine-lBb*kd gels were effpOttd io X-ray film 
overnight, iind then the silver stain tad film were utfad to excise 13d spots of 
vbiyuu btt««uW*», molecular weighs and pi*. The excised spots wero p»*«d in 
06-ml «fcf ooentrifnee lube* «d santai*** eocktan (100^1) was added.The 
samples weT* vtMtoxed and otxmlcd. In addition, t*o parallel gen were ele«»- 
triotted to polyvtaylidenc difluoride mcrabrwa. The roeatfrwi* eapwed 
to X-ray film, and four Intense single »ota were rarised Cram each rowj^ 
and subjected 10 *n»ino acid anafar&. For these fOM* spots, a mean of 209 ± 4 
qmtmuxA Ot proteln/methkmtoe was found. TWs number wn jrted to quant i late 
all remaning ipas in conjunction with lha number of methionines present in (M 



'"to'eWurt that proteins were labeled u> cquiUbriuni. parallel 2D gels were 
prepared and run Oil yeast melabollcalh/ labeled tot 1, 2, 6. or 1* h- The 
Boncapondinft 156 spots were excised from Men gel T and radwnCtMry was roca- 
wredby^LiidBclntilUtbil counting for each »J»t. CalCukited pfPteJn levels were 
highly reproducible for all time points measured after 1 h. 

CakulaUort of e-do- blaa and p«dktad hdlf-llfe. Codon but wluw were 
extraoed from the YPD spreadsheet (17). P~iein halMrvea were calalated 
based on the H-end rule (33). When the f+tcrmhv4 processing was not know* 
aiperimentiuV. it *w predicted bw*d on the affinity of ineibiooine umlnopep- 
(31) . 



RESULTS 

Characteristics ofproteome approach. Nearly every facet of 
proteomc analysis hinges on the unambiguous identification of 
large numbers of expressed proteins in cells. Several tech- 
niques have been described previously for the identification of 
proteins separated by 2DE, including N-termiuaJ and internal 
sequenciDg (1, 2), amino add analysis (38), and more recently 
mass spectrometry (25). We utilized techniques based on mass 
spectrometry because they afford the highest levels of sensitiv- 
ity and provide unambiguous identification. The specific pro- 
cedure used is schematically illustrated in Fig. 1 and b based 
on three principles. First, proteins are removed from the gel by 



proteolytic m-gel digestion, and the resulting peptides are sep- 
arated by on-line capillary high-performance liquid chromatog- 
raphy. Second, the ehiting peptides are ionized and detected, and 
the specific peptide ions are selected and fragmented by the 
mass spectrometer. To achieve this, the mass spectrometer 
switches between the MS mode (for peptide mass identifica- 
tion) and the MS/MS mode (for peptide characterization and 
sequencing). Selected peptides are fragmented by a process 
called collision-induced dissociation (C1D) to generate a tan- 
dem mass spectrum (MS/MS spectrum) that contains the pep- 
tide sequence information. Third, individual CU> mass spectra 
are then compared by computer algorithms to predicted spec- 
tra from a sequence database. This results in the identification 
of the peptide and, by association, the protein(s) in the spot. 
Unambiguous protein identification is attained in a single anal- 
ysis by the detection of multiple peptides derived from the 
same protein. 

Protein idenrrflcatlpn. Yeast total cell protein lysate (40 jLg), 
metabolically labeled with [^methionine, was clectro- 
phoretically separated by Isoelectric focusing in the first dimen- 
sion and by SDS-10% polyacrylamide gel electrophoresis in 
the second dimension. Proteins were visualized by silver stain- 
ing and by autoradiography. Of the more than 1,000 proteins 
visible by silver staining 156 spots were excised from the gel 
and subjected to in-gel tryptic digestion, and the resulting 
peptides were analyzed and identified by microspray LC- 
MS/MS techniques as described above. The proteins in this 
study were all identified automatically by computer software 
with no human interpretation of mass spectra. They are indi- 
cated in fig. 2 and detailed in Tabic 1. 

The C1D spectra shown in Fig. 3 indicate that the quality of 
the identification data generated was suitable for unambiguous 
protein identification. The spectra represent the amino acid 
sequences of trypric peptides NSGDIVNLGSIAGR (Fig. 3A) 
and FAVGAFTDSLR (Fig. SB). Both peptides were derived 
from protein S57593 (hypothetical protein YMR226C), which 
migrated to spot 114 (molecular weight, 29,156; pi, 6.59) In the 
2D gel in Fig, 2, Five other peptides from the same analysts 
were also computer matched to the same protein sequence. 

Protein and mRNA quantitation. For the 156 genes investi- 
gated, the protein expression levels ranged from £200 (PGM2) 
to-863,000 (TDH2/TDH3) copies/celL The levels of mRNA for 
each of the genes Identified were calculated from SAGE fre- 
quency tobies (35). These tables contain the mRNA levels tor 
4,665 genes in yeast strain YPH499 grown to mid-log phase in 
YPD medium on glucose as a carbon source. In some in- 
stances, the mRNA levels could not be calculated for reasons 
stated in Materials and Methods. For the proteins analyzed in 
this study, mean transcript levels varied from 0.7 to 473 copies/ 

Selection of the sample population for mRNA-protein ex- 
pression level correlation. The protein spots selected tor iden- 
tification were selected from spots visible by silver staining in 
the 2D gel. An attempt was made not to include spots where 
overlap with other spots was readily apparent. The number of 
proteins identified was 156 (Table 1). Some proteins migrated 
to more than one spot (presumably due to differential protein 
processing or modifications), and protein levels from these 
spots were calculated by integrating the intensities of the dif- 
ferent spots. The 156 protein spots analyzed represented the 
products of 128 different genes. Genes were excluded from the 
correlation analysis only if part of the data set was missing; ix n 
genes were eitcluaed if (i) no mRNA expression data were 
available tor the protein or putative SAOE tags were ambig- 
uous, (ii) the amino acid sequence did not contain methionine, 
(iii) more than a single protein was conchisrvery identified as 
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gene ORFs is presented in Fig. 4A. The interval with the 
largest frequency of genes' is between the codon bias values of 
0.0 and 0.1. This segment contains more ihan 2,500 gene*. The 
distribution of the codon bias valuta of the 128 different genes 
found in ibis study (all protein spots from Fig. 2) is shown in 
Fig, 4B, and protein half-lives (predicted from applying the 
N-cnd rule [33] 10 the experimentally deusrmincd or predicted 
protein N termini) arc shown in Fig, 4C No gene* were iden- 
tified with codon bias val«« less than 0.1 even though thou- 
sands of genes exist in this category. In addition, nearly all of 
the proteins identified had long predicted half-lives (greater 
than 30 h). 

Correlation of mRNA and protein expression levels. The 
correlation between mRNA and protein levels of the genes 
selected as described above is shown in Fig. 5. For the entire 
group (106 genes) for which a complete data set was gener- 
ated, there was a general trend of Increased protein levels 
resulting from increased mRNA levels. The Pearson product 
moment correlation coefficient for the whole data set (106 
genes) was 0.935 . This number is highly biased by a small 
number of genes with very large protein and message levels, A 
more representative Subset of the data is shown in the inset of 
Fig. 5. It shows genes for which the message level was below 10 
copies/cell and includes 69% (73 of 106 genes) of the data used 
in the study. The Pearson product moment correlation coeffi- 
cient for this data set was only 0356. We also fbnnd that levels 
of protein expression coded for by mRNA with comparable 
abundance varied by as much as 30-fold and that the mRNA 
levels coding for proteins with comparable expression levels 
varied by as much as 20-fold. 

The distortion of the correlation value induced by the un- 
even distribution of the data points along the j axis is further 
demonstrated by the analysis in Fig. The 106 samples in- 
cluded in the study were ranked by protein abundance, and the 
Pearson product moment correlation coefficient was repeat- 
edly calculated after including progressively more* and higher- 
abundance, proteins in each calculation. The correlation values 
remained relatively Stable in the range of 0.1 to 0.4 if the 
lowest-expressed 40 to 95 proteins used In this study were 
included However, the correlation value steadily climbed by 
the inclusion of each of the 11 very highly expressed proteins. 

Correlation of protein and mRNA expression levels with 
codon bias. Codon bias is the propensity for a gene to utilize 
the same codon to encode an amino acid even though other 
codons would insert the identical amino acid in the growing 
polypeptide sequence. It is further thought that highly ex- 
pressed proteins have large codon biases (3). To assess the 
value of codon bias for predicting mRNA and protein levels in 
exponentially growing yeast cells, we plotted the two experi- 
mental sets of dau versus the codon bias (Fig. 7). The distri- 
bution patterns for both mRNA and protein levels with respect 
to codon bias were highly similar. There was high variability in 
the data within the codon bias range of 0-8 to 1.0. Although a 
large codon bias generally resulted in higher protein and mes- 
sage' expression levels, codon bias did not appear to be predic- 
tive of either protein levels or mRNA levels in the cell. 

DISCUSSION 

The desired end point for the description of a biological 
system is not the analysis of mRNA transcript levels alone but 
also the accurate measurement of protein expression levels and 
their respective activities. Quantitative analysis of global 
mRNA levels currently is a preferred method for the analysis 
of the state of cells and tissues (U). Several methods which 
either provide absolute mRNA abundance (34, 35) or relative 
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FIG. 4. Current prc*»ome analyiii technology utilizing IDE without preen- 
richjDcnC sample* mainly highly expressed nod loog4rvwd prtrfaicis- Genes encod- 
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on codon bias. No genea with codon bias values lett ihao 0,1 were detected in the 
gmdy. (Q Distribution of Identified proteins in this study based on predicted 
half-Life (estimated by N-cnd rule). 



mRNA levels in comparative analyses (20, 27) have been de- 
scribed elsewhere. The techniques are fast and exquisitely sen- 
sitive and can provide mRNA abundance for potentially any 
expressed gene. Measured mRNA levels are often implicitly or 
explicitly extrapolated to indicate the levels of activity of the 
corresponding protein in the cell. Quantitative analysis of pro- 
tein expression levels (proteome analysis) is much more time- 
consuming because proteins are analyzed sequentially one by 
one and is not general because analyses are limited to the 
relatively highly expressed proteins. Proteome analysis does, 
however, provide types of data that are of critical importance 
for the description of the state of a biological system and that 
are not readily apparent from the sequence and the level of 
expression of the mRNA transcript. This study attempts to 
examine the re la dons hip between mRNA and protein expres- 
sion levels for a large number of expressed genes in cells 
representing the same state. 

Limits in the sensitivity of current protein analysis technol- 
ogy precluded a completely random sampling of yeast proteins. 
We therefore based the study on those proteins visible by silver 
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staining OH a 2D gel. Of the more than 1,000 visible spots, 156 
were chosen to include the entire range of molecular weights, 
isoelectric focusing points, and staining intensities displayed on 
the 2D protein pattern. The genes identified in this Study 
shared a number of properties. First, all of the proteins in this 
study had a codon bias of greater than 0.1 and 93% were 
greater than 0.2 (Fig. 4B), Second, with few exceptions, the 
proteins in this study had long predicted half-lives according to 
the N-end rule (Fig- Third, low-abundance proteins with 
regulatory functions such as transcription factors or protein 
kinases were not Identified. 

Because the population of proteins used in this study ap- 
peals to be fairly homogeneous with respect to predicted half- 
life and codon bias* it might be expected that the correlation of 
the mRNA and protein expression levels would be stronger for 
this population than for a random sample of yeast proteins. We 
tested this assumption by evaluating the correlation value if 
different subsets of the available data were included in the 
calculation. The 106 proteins were ranked from lowest to high- 
est protein expression level, and the trend m the correlation 
value was evaluated by progressively including more of the 
higher-abundance proteins in the calculation (Fig. 6). The cor- 
relation value when only the lower-abundance 40 to 93 pro* 
teins were examined was consistently between 0.1 and 0.4. If 
the 11 moat abundant proteins were included, the correlation 
steadily increased to 0.94. We therefore expect that the corre- 
lation fox all yeast proteins or for a random selection would bo 
less than 0.4. The observed level of correlation between 
mRNA and protein expression levels suggests the importance 



of posttranslationa) mechanisms controlling gene expression. 
Such mechanisms include translational control (15) and con- 
trol of protein half-life (33). Since these mechanisms are also 
active in higher eukaryotic cells, we speculate that there is no 
predictive correlation between steadiy-state levels of mRNA 
and those of protein in mammalian cells. 

Like other large-scale analyses, the present study has several 
potential sources of error related to the methods used to de- 
termine mRNA and protein expression levels. The mRNA 
levels were calculated from frequency tables of SAGE data. 
This method is highly quantitative because it is based on actual 
sequencing of unique tags from each gene, and the number of 
ti rnes that a lag is represented is proportional to the number of 
mRNA molecules for a specific gene. This method has some 
limitations including the following: (i) the magnitude of the 
error in the measurement of mRNA levels is inversely propor- 
tional to the mRNA levels, (ii) SAGE tags from highly similar 
genes may not be distinguished and therefore are summed, (iii) 
some SAGE tags are from sequences in the 3' untranslated 
region of the transcripU (iv) incomplete cleavage at the SAGE 
tag site by the restriction enzyme can result in two tags repre- 
senting one mRNA, and (v) some transcripts actually do not 
generate a SAGE lag (34, 35), 

For the SAGE method, the error associated with a value 
increases with a decreasing number of transcripts per cell. The 
conclusions drawn from this study are dependent on the qual- 
ity of the mRNA levels from previously published data (35). 
Since more than 65% of the mRNA levels included in this 
study were calculated to 10 copies/bell or less (40% were less 
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than 4 copiesfcell), the error associated with these values may 
be quite large. The mRNA levels were calculated from more 
than 20,000 transcripts. Assuming that the estimate of 15*000 
mRNA molecules per cell is correct (16), this would mean that 
mRNA transcripts present at only a single copy per cell would 
be detected 72% of the time (35). The mRNA levels for each 
gene were carefully scrutinized, and only mRNA levels for 
which a high degree of confidence existed were included in the 
correlation value. 

Protein abundance was determined by metabolic radiolabel- 
ing with psiracthiofline. The calculation required knowledge 
of three variables: the number of methionines in the mature 
protein, the radioactivity contained in the protein, and the 
specific activity of the radiolabel normalized per methionine. 
The number of methionines per protein was determined from 
the amino acid sequence of the proteins identified by tandem 
mass spectrometry. For some proteins, Jl was not known 
whether the methionine of the nascent polypeptide was pro- 
cessed away. The N termini of those proteins were predicted 
based on the specificity of methionine aminopeptidasc (31). If 
the N-terminal processing did not conform to the predicted 
specificity of processing enzymes, the calculation of the num- 
ber of methionines would be affected. This discrepancy would 
affect most the quantitation of a protein with a very low num- 
ber of methionines. The average number of calculated methi- 
onines per protein in this study was IX We therefore expect 
the potential for erroneous protein quantitation due to un- 
usual N-tenninal processing to be small. 



The amount of radioactivity contained in a single spot might 
be the sum of the radioactivity of comigrating proteins. Be- 
cause protein identification was based on tandem mass spec- 
trometric techniques, comigrating proteins could be identified. 
However, comigraiing proteins were rarely detected in this 
study* most likely because relatively small amounts of total 
protein (40 p.g) were initially loaded onto the gels, which re- 
sulted in highly focused spots containing generally 1 to 25 ng of 
protein. Because of the relatively small amount loaded, the 
concentrations of any potentially comigrating protein would 
likely be below the limit of detection of the mass speclrometry 
technique used in this study (1 to 5 ng) and below the limit of 
visualization by silver staining (1 co 5 ng). In the overwhelming 
majority of the samples analyzed, numerous peptides from a 
single protein were detected. It is assumed that any comigrat- 
ing proteins were at levels too low to be detected and that their 
influence in the calculation would be small. 

The specific activity of the radiolabel was determined by 
relating the precise amount of protein present in selected spots 
of a parallel gel, as determined by quantitative amino acid 
composition analysis, to the number of methionines present in 
the sequence of those proteins and the radioactivity deter- 
mined by liquid scintillation counting. It is possible that the 
resulting number might be influenced by unavoidable losses 
inherent In the amino acid analysis procedure applied. Because 
four different proteins were utilized in the calculation and the 
experiment was done in duplicate, the specific activity calcu- 
lated is thought to be highly accurate. Indeed the specific 



PAGE 23/34 * RCVD AT 5/21/2004 5:57:18 PNI [Eastern Daylight Time] * SVR:USPTO-EFXRF-1/0 ' DNIS:8729306 * CSID:650 496 1200 * DURATION (mnws):12-20 



05/21/2004 15:04 FAX G50 496 1200 

Vol. 19. 1999 
1000 



DNAX MAIN FAX 



| 024/034 



& 750 



§ 

■5 



500 



250 



o [protefn] 
a [mRNA] 



CORRELATION BETWEEN PROTEIN AND mRNA LEVELS IN YEAST 1729 

350 



580 

a 
o 

210 sS. 



A 

O 




°*2 



o.6 o.a 1.0 

Codon Bias 

PIG 7. Relationship ferwecji codon Wa« »d protein and mRNA levels In thb rtudy. Yeast mRNA and prc*cm mprtuaion Levela were calculated 
Materials uv) Meibod*. The data reprtsent the tame 106 Jpou <u In Fig. 5. 



140 "9 



1 



70 



L 0 



as described In 



activities calculated for each of the four proteins varied by less 
than 10%* Any inconsistencies in the calculation of the specific 
activity would result in differences in the absolute levels calcu- 
lated but not in the relative numbers and would therefore not 
influence (be correlation value determined. 

The protein quantitative method used eliminates a number 
of potential errors inherent in previous methods for the quan- 
titation of proteins separated by 2DE, such as preferential 
protein staining and bias caused by inequalities ill the number 
of radiolabeled resid ues per protein. Any 2D gcl-bascd method 
of quantitation is complicated by the fad that in some cases the 
translation products of the same mRNA migrated to different 
spots. One major reason is posttranslational modification or 
processing of the protein. Also, artifactual proteolysis during 
cell lysis and sample preparation can lead to multiple resolved 
forms of the protein. In such cases, the protein levels of spots 
coded for by the same mRNA were pooled* In addition, the 
existence of other spots coded for by the same mRNA that 
were not analyzed by mass spectrometry or that were below the 
Umit of detection for silver staining cannot be ruled out How- 
ever, since this study is based on a class of highly expressed 
proteins, the presence of undetected minor spots below silver 
staining sensitivity corresponding to a protein analyzed in the 
study would generally cause a relatively small error in protein 
quantitation* . 

Codon bias is a measure of the propensity of an organism to 
selectively utilize certain codons which result in the incorpo- 
ration of the same amino acid residue in a growing polypeptide 
chain. There are 61 possible codons that code for 20 amino 
acids. The larger the codon bias value, the smaller the number 
of codons that are used to encode the protein (19). It is 



thought that codon bias is a measure of protein abundance 
because highly expressed proteins generally have large codon 
bias values (3, 13). 

Nearly all of the most highly expressed proteins had codon 
bias values of greater than 0.8. However, we detected a number 
of genes with high codon bias and relative low protein abun- 
dance (Fig. 7). For example, the expressed gene with both the 
second largest protein and mRNA levels in the study was 
HN02 YEAST (775,000 and 269.1 copies/cell, respectively). 
£NOl"YEAST was also present in the gel at much lower 
protein and mRNA levels (44,200 and 0.7 copies/cell, respec- 
tively). The codon bias values for EN02 and ENOl are similar 
(0-96 and 0.93, respectively), but the expression of the two 
genes is differentially regulated. Specificalfy, EN01_YEA5T is 
glucose repressed (6) and was therefore present in low abun- 
dance under the conditions used Other genes with large codon 
bias values that were not of high protein abundance in the gel 
include EFT1, TIF1, HXK2, QSP1, EGDZ SHM2, and TALI. 
We conclude that merely determining the codon bias of a gene 
is not sufficient to predict its protein expression level- 

Interestingly, codon bias appears to be an excellent indicator 
of the boundaries of current 2D gel proteome analysis tech- 
nology, There are thousands of genes with expressed mRNA 
and likely expressed protein with codon bias values less than 
0.1 (Fig. 4A). In this study, we detected none of them, and only 
a very small percentage of the genes detected in this study had 
codon bias values between 0,f and 02 (Fig, 4B). Indeed, in 
every examined yeast proteome stucry (5, 7, 13, 28) where the 
combined total number of identified proteins is 300 to 400, this 
• same observation is true. It is expected that for the more 
complex cells of higher eukaryotic organisms the detection of 
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low-abundance proteins woukJ be even more challenging than 
for yeast. This indicates flint highly abundant, long-lived pro- 
teins arc overwhelmingly detected in pro tacrine studies. If pro- 
teome analysb is io provide truly meaningful information 
about cellular processes, it must be able, to penetrate to the 
level of regulatory proteins, including transcription factors and 
protein kinases. A promising approach is the «se of narrow- 
range focusing gels with immobilized pH gradients (IPG) (23). 
This would allow for the loading of significantly more protein 
per pH unit covered and also provide increased resolution of 
proteins with similar electrbphoretfc mobilities. A standard pH 
gradient in an isoelectric focusing gel covers a 7-pH-unit range 
(pH 3 to 10) over 18 cm. A narrow-range focusing gel might 
expand the range to (K5 pH units over 18 cm or more. This 
could potentially increase by more than 10-fold the number of 
proteins that can be delected. Clearly, current protcome tech- 
nology is incapable of analyzing low-abundance regulatory pro- 
teins whhout employing an enrichment method for relatively 
low-abuntlancc proteins. In conclusion, this study examined 
the relationship between yeast protein and message levels and 
revealed that transcript levels provide little predictive value 
with respect to the extent of protein expression. 
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